A Rule based Stemming Method for Multilingual Urdu Text
نویسندگان
چکیده
Urdu is a national language of Pakistan and spoken more than 200 million people use it as a verbal and written communication. There exists a large amount of unstructured Urdu textual data in the world; by applying data mining techniques useful information can be achieved. However it seriously lacks processing capabilities to develop innovative systems based on Urdu language. In this paper, authors present a rule based stemming method for Urdu language that has the ability to cope the challenges of Urdu infix stemming. The proposed stemming method generates the stem of Urdu words by removing prefix, infix and postfix from it. In this proposed Urdu stemming technique, authors have introduced two novel classes of Urdu infix words and a new minimum word length rule. To generate stem of Urdu word that belongs to proposed Urdu infix word classes, infix stripping rules are developed. The proposed Urdu stemming technique is competent to generate the stem of borrowed words and compound words, as well. The proposed approach is evaluated on Urdu headline news datasets. This proposed approach is compared with existing state-of-the art technique (A Light Weight Urdu Stemmer) to
منابع مشابه
A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language
Stemming is a procedure that conflates morphologically related terms into a single term without doing complete morphological analysis. Urdu language raises several challenges to Natural Language Processing (NLP) largely due to its rich morphology. The core tool of information retrieval (IR) is a Stemmer which reduces a word to its stem form. Due to the diverse nature of Urdu, developing its ste...
متن کاملRule Based Urdu Stemmer
This paper presents Rule based Urdu Stemmer. In this technique rules are applied to remove suffix and prefix from the inflected words. Urdu is well spoken language all over the world but less work has been done on Urdu stemming. Stemmer helps us to find the root of the inflected word. Various possibilities of inflected words like ںو (vao+noon-gunna), ے (badi-ye), ںای (choti-ye+alif+noon-gunna) ...
متن کاملAutomatic Diacritization for Urdu
Urdu language is written in Arabic script. In this script, the consonantal context is clearly represented, but the vocalic sounds are represented (mostly) by marks or diacritics, which are optional and normally not written. Readers can guess the diacritics and thus can pronounce words correctly, based on their knowledge of the language. But un-diacritized Urdu text creates ambiguity for novice ...
متن کاملAnalyzing Pre-processing Settings for Urdu Single-document Extractive Summarization
Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark co...
متن کاملRule-Based Named Entity Recognition in Urdu
Named Entity Recognition or Extraction (NER) is an important task for automated text processing for industries and academia engaged in the field of language processing, intelligence gathering and Bioinformatics. In this paper we discuss the general problem of Named Entity Recognition, more specifically the challenges in NER in languages that do not have language resources e.g. large annotated c...
متن کامل